batch 1
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving
Gao, Wei, Zhou, Xinyu, Sun, Peng, Zhang, Tianwei, Wen, Yonggang
Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the computation cost. Despite the development of many compression algorithms, their applications in production environments are still not prevalent. In this paper, we revisit mainstream \texttt{KV} \texttt{cache} compression solutions from a practical perspective. Our contributions are three-fold. First, we comprehensively review existing algorithmic designs and benchmark studies for \texttt{KV} \texttt{cache} compression and identify missing pieces in their performance measurement, which could hinder their adoption in practice. Second, we empirically evaluate representative \texttt{KV} \texttt{cache} compression methods to uncover two key issues that affect the computational efficiency: (1) while compressing \texttt{KV} \texttt{cache} can reduce memory consumption, current implementations (e.g., FlashAttention, PagedAttention) do not optimize for production-level LLM serving, resulting in suboptimal throughput performance; (2) compressing \texttt{KV} \texttt{cache} may lead to longer outputs, resulting in increased end-to-end latency. We further investigate the accuracy performance of individual samples rather than the overall performance, revealing the intrinsic limitations in \texttt{KV} \texttt{cache} compression when handling specific LLM tasks. Third, we provide tools to shed light on future \texttt{KV} \texttt{cache} compression studies and facilitate their practical deployment in production. They are open-sourced in \href{https://github.com/LLMkvsys/rethink-kv-compression}{https://github.com/LLMkvsys/rethink-kv-compression}.
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Overview (1.00)
- Research Report > Promising Solution (0.34)
I-SIRch: AI-Powered Concept Annotation Tool For Equitable Extraction And Analysis Of Safety Insights From Maternity Investigations
Singh, Mohit Kumar, Cosma, Georgina, Waterson, Patrick, Back, Jonathan, Jun, Gyuchan Thomas
Maternity care is a complex system involving treatments and interactions between patients, providers, and the care environment. To improve patient safety and outcomes, understanding the human factors (e.g. individuals decisions, local facilities) influencing healthcare delivery is crucial. However, most current tools for analysing healthcare data focus only on biomedical concepts (e.g. health conditions, procedures and tests), overlooking the importance of human factors. We developed a new approach called I-SIRch, using artificial intelligence to automatically identify and label human factors concepts in maternity healthcare investigation reports describing adverse maternity incidents produced by England's Healthcare Safety Investigation Branch (HSIB). These incident investigation reports aim to identify opportunities for learning and improving maternal safety across the entire healthcare system. I-SIRch was trained using real data and tested on both real and simulated data to evaluate its performance in identifying human factors concepts. When applied to real reports, the model achieved a high level of accuracy, correctly identifying relevant concepts in 90\% of the sentences from 97 reports. Applying I-SIRch to analyse these reports revealed that certain human factors disproportionately affected mothers from different ethnic groups. Our work demonstrates the potential of using automated tools to identify human factors concepts in maternity incident investigation reports, rather than focusing solely on biomedical concepts. This approach opens up new possibilities for understanding the complex interplay between social, technical, and organisational factors influencing maternal safety and population health outcomes. By taking a more comprehensive view of maternal healthcare delivery, we can develop targeted interventions to address disparities and improve maternal outcomes.
- Europe > United Kingdom > England > Leicestershire > Loughborough (0.05)
- Europe > United Kingdom > Wales (0.04)
- Health & Medicine > Therapeutic Area > Obstetrics/Gynecology (1.00)
- Health & Medicine > Public Health > Maternal Health (1.00)
- Health & Medicine > Consumer Health (1.00)
- Information Technology > Human Computer Interaction (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)
- Information Technology > Data Science > Data Mining (0.87)
Attention Mechanism for Lithium-Ion Battery Lifespan Prediction: Temporal and Cyclic Attention
Lee, Jaewook, Heo, Seongmin, Lee, Jay H.
Accurately predicting lithium-ion batteries (LIBs) lifespan is pivotal for optimizing usage and preventing accidents. Previous approaches often relied on inputs challenging to measure in real-time, and failed to capture intra- and inter-cycle data patterns simultaneously. Our study employ attention mechanisms (AM) to develop data-driven models predicting LIB lifespan using easily measurable inputs. Developed model integrates recurrent neural network and convolutional neural network, featuring two types of AMs: temporal attention (TA) and cyclic attention (CA). TA identifies important time steps within each cycle, CA strives to capture key features of inter-cycle correlations through self-attention (SA). We apply the developed model to publicly available data consisting of three batches of cycling modes. TA scores highlight the rest phase as a key characteristic to distinguish different batches. By leveraging CA scores, we decreased the input dimension from 100 cycles to 50 and 30 cycles with single- and multi-head attention.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Asia > South Korea > Daejeon > Daejeon (0.04)
- Energy > Energy Storage (1.00)
- Electrical Industrial Apparatus (1.00)
EETimes - Groq's AI Chip Debuts in the Cloud -
Groq's tensor streaming processor (TSP) silicon is now available to accelerate customers' AI workloads in the cloud. Cloud service provider Nimbix now offers machine learning acceleration on Groq hardware as an on-demand service for "selected customers" only. While there are several startups building AI silicon for the data center, Groq now joins Graphcore as the only two with accelerators commercially available for customers to use as part of a cloud service. Graphcore previously announced its accelerators are available as part of Microsoft Azure. "Groq's simplified processing architecture is unique, providing unprecedented, deterministic performance for compute intensive workloads, and is an exciting addition to our cloud-based AI and Deep Learning platform," said Steve Hebert, Nimbix' CEO.